Monitoring Complex Data Feeds Through Ensemble Testing

ABSTRACT

Managing and monitoring multiple complex data feeds is a major challenge for data mining tasks in large corporations and scientific endeavors alike. The invention describes an effective method for flagging abnormalities in data feeds using an ensemble of statistical tests that may be used on complex data feeds. The tests in the ensemble are chosen such that the speed and ability to deliver real time decisions are not compromised.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of co-pending U.S. application Ser.No. 11/275,395 entitled “Monitoring Complex Data Feeds Through EnsembleTesting” filed Dec. 29, 2005, which is expressly incorporated in itsentirety herein by reference.

FIELD OF THE INVENTION

The present invention relates to managing and monitoring multiplecomplex data feeds to discover abnormalities using an ensemble ofstatistical tests.

BACKGROUND OF THE INVENTION

As the ability to collect, transmit and store data grows, the challengesof managing, cleaning and mining this data grows. Typical data miningapplications draw from multiple, inter-dependent feeds, originating frommultiple and varied sources. Some applications log well over a terabyteof incoming data a month from hundreds of source feeds containingthousands of files. Most known solutions for managing data feeds rely oneither ad hoc methods tailored to a particular application or addressthe problem superficially using limited functionality offered bycommercial database systems or hastily marshaled in-house scripts.

Manual monitoring of feeds and tasks of this size is quite untenable aswell as undesirable due to the potential for introducing human errors.Also, it is important to respond quickly as there is a short windowduring which feed files that have failed in transmission or otherwisemay be retransmitted. Therefore, if an abnormality is noticed that isoutside expectations, it needs to be flagged immediately for furtherinvestigation and remediation. For example, it may be known that aparticular data feed should send a particular quantity of files at aparticular time. If less than what is expected is received, a timelyrequest may be made to retransmit the files to ensure that all filesexpected are received.

The use of statistical tests to monitor the quality of the data feeds isknown in the art but current applications do not provide for use of aflexible and efficient method or system that can cover a wide variety ofstatistical distributions and anomalies. Current data miningapplications use tests based on a single attribute (univariate) ratherthan multiple attributes and are only capable of flagging veryparticular types of abnormalities. These univariate tests may notprovide the user with an abundance of confidence as individual tests maybe limited in scope and application. Such known tests include Hampelbounds and trimmed means and the three-sigma limit types tests.

In addition, one current drawback to current data monitoring and miningapplications is that users have found it difficult to visualize theresults or indications of discovered abnormalities in the data feeds. Amechanism for displaying the results of various statistical tests tousers who interpret such results would be beneficial.

Therefore, there is a need in the art for a method of managing andmonitoring multiple complex data feeds in a computational light weightmanner to discover abnormalities. The method should provide a user withan efficient way to alert users to the abnormalities so that a responsecan be rapidly deployed.

SUMMARY

Aspects of the present invention overcome problems and limitations ofthe prior art by providing a method for monitoring and managing datafeeds using a statistical ensemble of tests. In an aspect of theinvention, an ensemble of tests is chosen such that the speed andability to deliver real time decisions are not compromised. Furthermore,the use of multiple tests allows for the detection of an assortment ofpotential anomalies and provides a user with confidence that thedetection is valid as the detection is based on a multitude ofstatistical tests.

In an exemplary aspect of the invention, upper and lower error boundsare determined for an ensemble of tests. The upper and lower bounds maybe based on historical data or expert knowledge. The ensemble of testsis applied to the data feeds for detection of abnormalities. Theindividual test comprising the ensemble of tests may be assigned weightsbased on validation from domain experts or historical data. The resultsof the monitoring may be displayed on a switchboard to users supervisingthe system.

The details of these and other embodiments of the present invention areset forth in the accompanying drawings and the description below. Otherfeatures and advantages of the invention will be apparent from thedescription and drawings, and from the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention may take physical form in certain parts and steps,embodiments of which will be described in detail in the followingdescription and illustrated in the accompanying drawings.

FIG. 1 illustrates a Hampel bounds test for use in an ensemble ofmonitoring tests that may be used in accordance with an aspect of theinvention.

FIG. 2 illustrates a quantiles test for use in the ensemble ofmonitoring tests that may be used in accordance with an aspect of theinvention.

FIG. 3 illustrates a 5% trimmed mean bounds test for use in an ensembleof monitoring tests that may be used in accordance with an aspect of theinvention.

FIG. 4 illustrates a three sigma bounds test for use in an ensemble ofmonitoring tests that may be used in accordance with an aspect of theinvention.

FIG. 5 illustrates a trimmed average test for use in an ensemble ofmonitoring tests that may be used in accordance with an aspect of theinvention.

FIGS. 6 illustrates a 3 sigma average test for use in an ensemble ofmonitoring tests that may be used in accordance with an aspect of theinvention.

FIG. 7 illustrates a switchboard for displaying the ensemble testresults in accordance with an aspect of the invention.

FIG. 8 shows a diagram of a system and network that may be used toimplement aspects of the invention.

FIG. 9 illustrates a flow diagram for monitoring multiple data feeds forabnormalities in accordance with an aspect of the invention.

DETAILED DESCRIPTION Exemplary Operating Environment

FIG. 8 shows a diagram of a computer system and network that may be usedto implement aspects of the invention. A plurality of computers, such asworkstations 102 and 104, may be coupled to a computer 112, via anetwork 108, 128, and 118. Computers 112, 114, and 116 may be coupled toa network 128 through network 118. Computers 112, 114, and 116 alongwith workstations 102 and 104 may provide multiple complex data feeds tonetwork 128. Similarly, data gathering systems 120 and 123 may collectdata and transmit that data directly or indirectly (via Internet 198) tonetwork 128. The data gathering systems 120 and 123 may be connected toa host of various devices such as telephone 181, cellular phone 182, PDA183, handheld device 184, and ATM device 185. Those skilled in the artwill realize that other special purpose and/or general purpose computerdevices may also be connected to the data gathering systems 120 and 123.Such device may include credit card terminals, handheld devices,multiprocessor systems, microprocessor-based or programmable consumerelectronics, networked PCs, minicomputers, mainframe computers, and thelike.

One or more of the computer devices shown in FIG. 1 may include avariety of interface units and drives for reading and writing data orfiles. One skilled in the art will appreciate that networks 108, 118,and 128 are for illustration purposes and may be replaced with fewer oradditional computer networks. One or more networks may be in the form ofa local area network (LAN) that has one or more of the well-known LANtopologies and may use a variety of different protocols, such asEthernet. One or more of the networks may be in the form of a wide areanetwork (WAN), such as the Internet. Computer devices and other devicesmay be connected to one or more of the networks via twisted pair wires,coaxial cable, fiber optics, radio waves or other media.

The term “network” as used herein and depicted in the drawings should bebroadly interpreted to include not only systems in which remote storagedevices are coupled together via one or more communication paths, butalso stand-alone devices that may be coupled, from time to time, to suchsystems that have storage capability. Consequently, the term “network”includes not only a “physical network” but also a “content network,”which is comprised of the data—attributable to a single entity—whichresides across all physical networks.

Network 128 may include monitoring hardware 190 to monitor the datafeeds from the above numerous sources. The monitoring hardware 190 mayinclude a processor, memory and other conventional computer componentsand may be programmed with computer-executable instructions tocommunicate with other computer devices.

Exemplary Embodiments

A method of determining outliers and abnormalities in complex data feedsthrough use of an ensemble of statistical tests is illustrated in thebelow described aspects of the invention. Abnormalities for use in thisapplication and claims refer to unexpected behavior discovered duringmonitoring process, for example, failure to send files when expected.The number of statistical tests used in the ensemble of statisticaltests may vary based upon the nature of the monitoring tasks. Theinvention provides a simple methodology for monitoring and analyzingcomplex, massive, multivariate data feeds. The technique is fast andeffective. Additionally, the method may also utilize historical data andknowledge of experts to determine a weighting or ranking scheme for theselected tests.

In an aspect of the invention, bounds are determined to establish abaseline from which outliers and abnormalities may be determined. Thebounds selected may be based upon the different tests used in theensemble of tests. Based on the determined bounds, an alert may begenerated when the data is determined to be a certain threshold from thebaseline.

A method of computing an alert in accordance with an aspect of theinvention may consist of testing a decision criterion in the form:

LB≦(T(X)−C)/S≦UB   (Equation 1)

-   -   where    -   LB represents a lower tolerance bound,    -   UB represents an upper tolerance bound,    -   T(X) is a statistic (estimate such as mean) computed from the        data X,    -   C is a shift or offset to account for typical values of T(X) and    -   S is a scaling factor to account for the spread in the values of        the statistic T(X).

The above equation may represent baseline parameters of a particulartest. Different definitions of the bounds, the shift and scale may giverise to different types of tests. These are non-parametric estimates sothat the bounds may have a consistent meaning irrespective of theunderlying probability distribution of data X.

The baseline parameters may be computed using either a gold standarddata set or using historical data. In an alternative embodiment, thebaseline parameters being used to calculate the bounds may be determinedby experts using experience with the data being monitored. Forillustrative purposes in the described examples, the baseline parametersfor each test were calculated using three months of historical data.Those skilled in the art will realize that any time period of historicaldata may be used and that three months of historical data is only oneillustrative example. In addition, a particular time period relevant tothe data streams being monitored may also be selected. In addition, anappropriate time window may be based on domain knowledge of theparticular data feeds. In the absence of such knowledge, recenthistorical data may be used to estimate the frequency of significantshifts in the distribution of the data.

In an exemplary embodiment of the invention, the baseline parameters maybe computed for each of the 24 hours of the day using the three monthhistorical data. That is, if Y=Y(H)_(i), i=1, . . . , 90 represents allthe data collected during hour H of the day during the three months (forexample, assuming there are 90 days in three months), then

-   -   LB(H)=LB(Y(H)_(i), i=1, . . . , 90)    -   UB(H)=UB(Y(H)_(i), i=1, . . . , 90)    -   C(H)=C(Y(H)_(i), i 1, . . . , 90)    -   S(H)=S(Y(H)_(i), i=1, . . . , 90)

where H=1, . . . , 24 represents the hours of the day. Weekdays may betreated separately from the weekends because the underlying domainexhibits different behavior in each of those cases.

To test a given hour H_(t) for abnormalities, one may compute the teststatistic T(X) using the data X_(i), X₂, . . . , X_(n), that wasaccumulated in the test hour H_(t) and compare it to the baselineparameters for the corresponding hour. An alert may be issued if thetest statistic T(X) for the hour H_(t) being tested fails to satisfy thedecision criterion in Equation 1, where one uses baseline parametersfrom the lookup table for the corresponding hour H_(t). In the remainingdiscussion and for ease of understanding, the H_(t) notation has beendropped and as such the comparison between the test statistic T(X) andbaseline parameters in the decision criterion for the below examples isalways on an hourly basis, between the corresponding hours.

In an aspect of the invention, various tests may be selected based uponthe nature of the monitoring task. For example, the goal may be toisolate outliers (unusual readings), building representative summaries,or creating data extracts to feed other applications such asvisualization software. Based on these criteria, different combinationsof tests may be selected.

The tests used may be simple nonparametric tests that use error boundsto identify outliers. A variety of tests ranging from those based onHampel bounds and trimmed means, to the classical three-sigma limits foraverages may be used. For example, the Hampel and trimmed mean boundstests are robust to contamination and are not influenced by outliers.Robustness and breakdown point (the amount of data that can be corruptedwithout influencing the estimator) are important concepts in statisticsthat have been researched exhaustively to build robust estimators andtests. The three-sigma tests are familiar to those persons skilled inthe art.

Choosing a suite of such tests helps to customize the ensemble to adesired level of sensitivity. The 3-sigma tests are sensitive tooutliers and can be dramatically changed by a single aberrantobservation. On the other hand, the tests based on Hampel and trimmedmean bounds are insensitive to significant changes so that they do notreflect underlying shifts in distributions that are relatively subtle,until a dramatic shift has occurred. The combination of such tests asdescribed in the current description provides an improved array ofabnormalities detection for the variety of processes that generate thedata.

In addition, though most of the tests herein described are univariate,the concept of statistical ensembles can easily incorporate multivariatetests like Hotelling's T² for detecting differences as well as temporalmodels if needed. (For additional prior art resources regardingdetecting difference see; R. L. Mason, C. Champ, N. Tracy, S. Wierda,and J. Young. Assessment of multivariate process control techniques.Journal of Quality Technology, 29:140-143, 1997. For additional priorart information on temporal models see; G. Box, G. M. Jenkins, and G.Reinsel, Time Series Analysis: Forecasting & Control. Prentice Hall,1994.)

Furthermore, one may incorporate recent tests for change detection inmulti-dimensional data feeds for more dynamic data. (For additionalprior art information see; D. Kifer, S. Ben-David, and J. Gehrke.Detecting change in data streams. In VLDB Conference, 2004.)

In an aspect of the invention, the Hampel identifier test may be used asone of the statistical tests. The Hampel test is a nonparametric testthat is based on robust estimates of the center and scale, the Medianand the Median Absolute Deviation and its asymptotic behavior.Robustness implies stability with respect to extreme outliers that mayoccur. (For additional prior art information see; P. J. Huber. RobustStatistics. Wiley, New York, 1981.)

Therefore, the Hampel test offers protection against flagging alertsprecipitously based on a few extreme observations. As an example, thebaseline parameters for the Hampel test may include:

-   -   LB=−3, UB=3,    -   C=Median    -   S=1.4826*(Median of|T(X)−C|).

In the above example, T(X) is the median of the data gathered duringthat particular hour being tested, and LB, UB, C, and S are the baselineparameters computed from the three month historical data for thatcorresponding hour. The constant S ensures unbiasedness for certaintypes of distributions. (For additional prior art information onconstant S see; L. Davies and U. Gather. The identification of multiplesoutliers. Journal of the American Statistical Association, 88:782-801,1993.)

In another aspect of the invention, the quantiles test may be used asone of the statistical tests. The quantiles test is an ordering testwhich allows a way of automatically flagging really small and reallylarge data points. The quantiles test may be valuable if one wants toscreen the very top portion of the data or a particular area of the datasuch as the top five percent of the largest files. In the quantilestest, one may compute the highest X percentile and the lowest Ypercentile based on the historical data corresponding to the hour of theday that we are testing. The baseline parameters may be for example:

-   -   LB=5^(th) percentile, UB=95^(th) percentile    -   C=0    -   S=1.0

Those skilled in the art will realize that the upper and lower bound donot need to be symmetric.

In another aspect of the invention, tests may be based on the classicalCentral Limit Theorem and the sampling distribution of the sample mean.For example, such test may include 5% trimmed mean, 5% trimmed mean log,3-sigma and 3-sigma log tests. (For additional priort art information onthese tests see; C. R. Rao. Linear Statistical Inference and ItsApplications. Wiley, New York, 1973.)

In these test we note that the mean T(X) of all data gathered during anhour, has a normal distribution with parameters that can be computedfrom the three month historical data for the corresponding hour of day.

The phrase trimmed mean refers to the fact that one may “trim” a certainportion of the data by dropping it from the computations. Those skilledin the art will realize that it is acceptable to trim up to 10% to 20%of the data, ensuring that the bounds (baseline parameters) are notinfluenced by the occasional aberrant observation. (For additional priorart information see; P. J. Huber. Robust Statistics. Wiley, New York,1981.)

The baseline parameters may be for example:

-   -   LB=−3*K(H), UB=3*K(H)    -   C=Mean    -   S=Standard Error

The Standard Error is the standard deviation scaled by the number ofdata points X_(i) (sample size) in the hour for which we are conductingthe test. (For additional prior information see; C. R. Rao. LinearStatistical Inference and Its Applications. Wiley, New York, 1973.)

The Mean and the standard deviation may be calculated from thehistorical data. An internal scaling constant K(H) may be used that addsa slight twist on the conventional control charts. This constant maydepend on the hour of the day and is meaningful only for thisapplication and adds an extra piece of information to the chart that mayassist users. Because the same constant is applied to the test statisticT(X) the alert outcome and its interpretation remain unaltered. (Foradditional prior art information about classical control charts whereK(H)=1 see; A. J. Duncan. Quality Control and Industrial Statistics.Irwin, Homewood, 1974.)

In another aspect of the invention, various weights may be assigned tothe selected tests. The weights may be assigned by ranking them in theorder of agreement either with empirical evidence from historical dataof alerts or in the order of agreement with knowledge experts who labelthe alerts as genuine or false. Those skilled in the art will realizethat other means of ranking are possible and applicable. As a default,equal weights may be assigned to each of the tests. In another aspect ofthe invention, various weights may be applied to the historical data.For example, more weight may be given to recent historical data and lessweight to older historical data.

In another aspect of the invention, the tests may be updatedperiodically due to changes in the processes that generate the data suchas increased traffic, new network elements that are added resulting inincreased feed volumes and frequency. As a consequence, the statisticaldistributions of the data feed characteristics change as well. In anembodiment, the baseline parameters used in equation 1 are updated whenstatistically significant changes are detected in the data feeds. Inaddition, feedback from the system may be used to validate both theensemble of tests as well as the opinions of the experts. If a system isunaffected by or recovers rapidly with no fallout from alerts that areconsistently tagged as “authentic” then the tests as well as the expertshave to be re-evaluated.

FIG. 7 illustrates an exemplary switchboard 702 which shows the resultsof an ensemble comprising six tests. FIG. 7 is exemplary of a monitoringapplication where multiple feeds are received and sent to acomputational cluster where they are cleaned, combined, prepared forvarious data mining tasks and ultimately archived. It is important insuch a monitoring application to ensure that all the required feed fileshave arrived uncorrupted in a timely fashion.

The following example is an illustrative example of the invention and isnot intended to limit the scope of the present invention. FIG. 7displays the results of the ensemble of tests of the monitoring systemto ensure the smooth flow of data, processes, data mining algorithms andresults.

It is noted that log files play an important role in monitoring datafeed activities and health. The log files contain data about when, whereand which files were received and which processes and machines touchedthese files. In addition, the system may maintain a variety of metadataabout the contents and the nature of the feed files. The metadata aswell as the data in the files themselves may be monitored, repaired andanalyzed.

Data contained in the log files may be aggregated into hourly datasummaries that describe various aspects e.g., number of files thatarrive during that hour, total of the file sizes for that hour, numberof errors e.g., mangled headers or mismatched checksums and so on. As anexample, the “file size” attribute may be used to illustrate theensemble. The following discussion only discusses a single attribute tosimplify the example. Those skilled in the art will realize thatmultiple attributes may be used and that multivariate tests may also beused.

The hourly totals may be grouped by hour of the day and are used tocompute the baseline parameters to build nonparametric tests based onquantiles and means. Using nonparametric tests ensures that the testsare widely applicable, as opposed to tests based on restrictive modelsbased on distributional assumptions.

Furthermore, in the current example the data is grouped by hour of daydue to strong daily cyclical patterns. The results of the alerts aredisplayed on visual “switchboard” 702 that is easy to read andunderstand. The switchboard lights up whenever a test flags an out ofbound reading. The more lights that turn on, the greater our confidencethat the alert is genuine and warrants immediate attention. Furthermore,weights may be assigned to the tests based on empirical validation oragreement with experts.

FIGS. 1-6 show the results of using an ensemble of six tests formonitoring the exemplary data feeds based on Hampel bounds, quantiles(95% and 5% bounds), trimmed mean bounds and three sigma bounds testsapplied to average file size and average of log transformed file sizedata. Those skilled in the art will realize that more sophisticatedtests based on multiple attributes, or tests that capture temporalpatterns or detect changes in data streams may be used. Other variantsto include in the ensemble are obtained by changing the window ofhistorical data from three months or by varying the weights (more weightto recent data, less as we go farther back in time).

FIG. 1 shows the hourly readings for a one-week period (hours 1 to 168)based on the Hampel bounds test 1002. FIG. 1 indicates when the data isbelow or above the baseline parameters. Because the bounds are constant,the chart resembles process control charts which are easy to read. Asmentioned above, the Hampel identifier test is robust test that is notunduly affected by one single bad observation in the three-month historyused to compute the Hampel bounds 1002. In FIG. 1, the upper bounds1004, the lower bounds 1006, and the expected values 1008 are indictedby solid lines. As shown in FIG. 1, the total file sizes are wellbehaved initially but go below the lower bounds 1006, then returnbriefly to within bounds and then rapidly go below the lower bounds 1006again. Clearly the data indicates a problem. It may be seen in FIG. 7that whenever the Hampel test indicates an outlier a dot is plotted atthe corresponding time period on the switchboard in FIG. 7 with a testalert value=1 (704).

FIG. 2 shows the upper and lower bounds based on the 5^(th) and 95^(th)quantiles respectively for use in the quantiles test 2002. The quantilestest 2002 is useful when we know a priori that we want to examine suchas the 10% of our most extreme data, irrespective of whether it iswithin acceptable limits or not. For instance, one might want to monitorthe biggest and smallest files for duplication and completenessrespectively. Similarly, in other contexts, one might want to monitorthe network elements that handle the least and most traffic at any givenpoint in time.

The upper and lower bound do not have to be symmetric bounds. Forexample, the 1^(st) and 97^(th) percentiles may be selected, if oneknows that the distribution is skewed. Whenever the quantiles test flagsoutliers, a dot is plotted at the corresponding time period with a testalert value=2 (706) on the switchboard in FIG. 7.

FIG. 3 and FIG. 4 are based on a log-transform of the data. Inparticular, FIG. 3 is based on 5% trimmed mean bounds (3002) and FIG. 4is based on a three sigma bounds (4002). In FIG. 3, the upper bounds3004, the lower bounds 3006, and the expected values 3008 are indictedby solid lines. Similarly, in FIG. 4, the upper bounds 4004, the lowerbounds 4006, and the expected values 4008 are indicted by solid lines.These tests may be effective when a user wants to flag outliers measuredin magnitudes rather than simple standard deviations. The transformationis convenient for long tailed distributions like the log-normal.However, because we have an ensemble to capture a wide set of scenarios;we do not need to anticipate the distribution of the data. Whenever the5% trimmed mean log test flags outliers, a dot is plotted at thecorresponding time period with a test alert value=3 (708) on theswitchboard in FIG. 7. Similarly, whenever the 3 sigma log test flagsoutliers, a dot is plotted at the corresponding time period with a testalert value=4 (708) on the switchboard in FIG. 7.

Finally, the last two exemplary tests shown in FIGS. 5 and 6 are thebounds computed from a trimmed average (FIGS. 5, 5002) and 3 sigmaaverage (FIG. 6, 6002) of the untransformed data. In FIG. 5, the upperbounds 5004, the lower bounds 5006, and the expected values 5008 areindicated by solid lines. Similarly, in FIG. 6, the upper bounds 6004,the lower bounds 6006, and the expected values 6008 are indicted bysolid lines. Whenever the 5% trimmed mean test flags outliers, a dot isplotted at the corresponding time period with a test alert value=5 (712)on the switchboard in FIG. 7. Similarly, whenever the 3 sigma test flagsoutliers, a dot is plotted at the corresponding time period with a testalert value=6 (714) on the switchboard in FIG. 7.

As discussed above, FIG. 7 shows the switchboard with dots correspondingto various test alerts. Those skilled in the art will realize thatswitchboard 702 may take visual forms other than the shown graph andcover various time periods of interest to the user. In FIG. 7, theswitchboard 702 is relatively peaceful and blank towards the early partof the week, with no dots to indicate alerts that the feed is out ofcontrol. It is clear that during the later part of the week, there is apersistent problem with the feed as indicated by all the tests lightingup. In addition, FIG. 7 also indicated that some of the exemplary testused in the example may be more sensitive than others. For example, inFIG. 7, the Hampel (Test Alert Value=1) and the Quantile test (TestAlert Value=2) are both set off. It is interesting that the Hampel testflags alerts, indicating that there is a strong wholesale movement inthe data. This is appropriate as one can see from the data. However, forobvious reasons, the tests based on the log transforms (Test AlertValues=3 (trimmed) and Test Alert Value=4 (mean)) are more conservative.These tests are useful when one is looking for shifts in orders ofmagnitudes rather than a simple standard deviation away. Finally, the3-Sigma charts based on the 5% trimmed mean (Test Alert Value=5) andmean (Test Alert Value=6) are closely tied to the data and light upquite frequently. Again, the fact that the trimmed mean test flagsalerts indicates that the shift in the data is quite significant.

FIG. 9 illustrates a flow diagram for monitoring multiple data feeds forabnormalities in accordance with an aspect of the invention. In FIG. 9at step 902, at least two statistical tests are selected that will beused in the ensemble of tests. The statistical tests may be chosen basedon the nature of the monitoring task. For example, the monitoring taskmay be to monitor data feed from various call centers that are receivingat a central location. The feeds may include call records for variouscalls completed by customers. The call records may be forwarded to thecentral location for creation of accurate billing statements. In analternative scenario, the at least two statistical tests may be selectedto monitor credit card transactions that are being forwarded fromnumerous locations to a central location for creation of billingstatements. Those skilled in the art will realize that the nature of themonitoring task will assist in the proper selection of the multitude ofstatistical tests to be used in the ensemble of tests.

In step 904, the bounds for the ensemble of tests may be determined. Thebound may be determined based on equation 1 disclosed above. Next, instep 906 each of the selected tests in the ensemble of tests may beweighted. The selection of the weight may be determined based onhistorical data or expert experience. In step 908, the ensemble of testsis applied to the various data feed to monitor for abnormalities.

Next, in step 910, the results of the applied ensemble of tests may bedisplayed. The display may take the form of a switchboard as shown inFIG. 7. In step 912 the tests used in the ensemble of tests may beperiodically validated. The validation may be used to provide feedback(step 914) to a user in the selection of tests in step 902.

While the invention has been described with respect to specific examplesincluding presently preferred modes of carrying out the invention, thoseskilled in the art will appreciate that there are numerous variationsand permutations of the above described systems and techniques that fallwithin the spirit and scope of the invention.

1. A method of monitoring multiple data feeds for abnormalities, themethod comprising: selecting at least two statistical tests to form anensemble of statistical tests; applying the ensemble of statisticaltests to the multiple data feeds; and displaying the results of the atleast two statistical tests to determine abnormalities.